feat(simd_avx2): add U8x32 — native AVX2 byte vector (round-3 keystone)#144
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b538025f26
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| #[inline(always)] | ||
| pub fn mask_blend(mask: Self, a: Self, b: Self) -> Self { | ||
| // SAFETY: AVX2 baseline. | ||
| Self(unsafe { _mm256_blendv_epi8(b.0, a.0, mask.0) }) |
There was a problem hiding this comment.
Preserve U8x64 blend argument order
U8x64::mask_blend(mask, a, b) and the scalar fallback select b when the mask bit is set, but this AVX2 implementation passes the operands to _mm256_blendv_epi8 in the opposite order, so set mask lanes return a. Any consumer porting code from the existing U8x64 API or feeding a comparison mask as a predicate will get the inverse result on every selected byte; call _mm256_blendv_epi8(a.0, b.0, mask.0) instead, or the new type will not mirror the documented API shape.
Useful? React with 👍 / 👎.
The keystone for the cosmetic-SIMD sweep agent #11 audited on PR #142. That audit found 8 confirmed cosmetic SIMD wrappers in hpc/byte_scan.rs, hpc/palette_codec.rs, and hpc/aabb.rs — `#[target_feature(enable = "avx2")]` decorating scalar bodies that gave zero speedup over plain scalar. The root cause: there was no `U8x32` type in the polyfill, so consumers couldn't write SIMD byte code at AVX2's natural width (32 bytes = one __m256i ymm register). This PR adds U8x32 with real __m256i storage and 26 polyfill methods mirroring `simd_avx512::U8x64`: Constructors: splat, from_slice, from_array, to_array, copy_to_slice Reductions: reduce_sum (wrap-add), reduce_min, reduce_max, sum_bytes_u64 Min/max: simd_min, simd_max (_mm256_min_epu8, _mm256_max_epu8) Compare→mask: cmpeq_mask → u32, cmpgt_mask → u32 (unsigned via xor 0x80), movemask → u32 (matches _mm256_movemask_epi8 width) Saturating: saturating_add, saturating_sub (_mm256_adds/subs_epu8) Avg: pairwise_avg (_mm256_avg_epu8, round-up) Shifts: shr_epi16, shl_epi16 (16-bit lane shifts via _mm256_srl/sll_epi16) Shuffles: shuffle_bytes (within-128-bit-lane, _mm256_shuffle_epi8) permute_bytes (cross-lane, scalar fallback — AVX2 has no native cross-lane byte permute; matches U8x64's behavior on AVX-512F-without-VBMI hosts) unpack_lo_epi8, unpack_hi_epi8 (_mm256_unpacklo/hi_epi8) Conditional: mask_blend (_mm256_blendv_epi8, MSB-driven, NOT bitmask) LUT: nibble_popcount_lut Plus operators: BitAnd, BitOr, BitXor, Add (wrapping), Sub (wrapping), Debug, Default. All ~26 methods. Re-exported from `crate::simd::U8x32` for both AVX-512 and AVX2 build tiers — U8x32 is the natural AVX2 byte width and is needed regardless of whether AVX-512's U8x64 is the consumer's preferred width. Soundness model matches the rest of simd_avx2.rs: `_mm256_*` intrinsics are wrapped in `unsafe { }` blocks inside safe `pub fn`, trusting that AVX2 is the compile target (x86-64-v3 is project baseline). The codebase uses this pattern already in the AVX2 popcount at simd_avx2.rs:357. Test coverage: - 18 new tests in `mod u8x32_tests` covering: roundtrip, sum/min/max reductions, unsigned cmp masks (incl. high-byte > 127 to verify the XOR-0x80 unsigned trick), saturating add/sub clamps, pairwise_avg round-up, shr_epi16 nibble extraction, permute_bytes reverse, mask_blend per-MSB selection, nibble_popcount_lut via shuffle_bytes. - All 18 pass. Total test count 1786 → 1804 with no regressions. clippy --features rayon -- -D warnings: clean. Companion: this PR unblocks the round-3 consumer fleet which will rewrite byte_find_all_avx2 / pack_indices / aabb_intersect_batch_sse41 and friends to use `crate::simd::U8x32` instead of `#[target_feature]` wrappers around scalar code. Each consumer rewrite ships as its own PR in the next wave.
b538025 to
521d23f
Compare
…mt fix) The format/nightly CI job on PR #144 flagged two sites in the U8x32 additions: 1. `nibble_popcount_lut` — 32-byte literal split into two 16-element rows for readability. Nightly rustfmt's chains_overflow_last_block + width budget collapse it to one line. Restored. 2. `permute_bytes_reverse` test — a 3-method chain was on one line. Nightly rustfmt wants each `.method()` on its own line under `chain_width = 60`. Restored. No semantic change. `cargo +nightly fmt --all --check` clean after.
Summary
Round-3 W1 (foundation) — adds U8x32, the native AVX2 byte vector type that the round-2 audit fleet (agent #11) identified as the keystone gap. Without it, the 8 cosmetic-SIMD wrappers in
hpc/byte_scan.rs,hpc/palette_codec.rs, andhpc/aabb.rs(#[target_feature(enable = "avx2")]decorating scalar bodies) couldn't be rewritten to use the polyfill — there was no type to call.This PR ships the type + 26 methods + 18 regression tests. The consumer rewrites (cosmetic-SIMD sweep) ship in a follow-up fleet of PRs.
API surface
U8x32mirrorssimd_avx512::U8x64at the natural AVX2 width (32 bytes = one__m256i):splat,from_slice,from_array,to_array,copy_to_slicereduce_sum(wrap-add),reduce_min,reduce_max,sum_bytes_u64simd_min,simd_maxcmpeq_mask → u32,cmpgt_mask → u32,movemask → u32saturating_add,saturating_subpairwise_avg(round-up,_mm256_avg_epu8)shr_epi16,shl_epi16(16-bit lane shifts)shuffle_bytes(within-lane),permute_bytes(cross-lane scalar fallback),unpack_lo_epi8,unpack_hi_epi8mask_blend(MSB-driven, NOT bitmask — distinct fromU8x64::mask_blend)nibble_popcount_lutBitAnd,BitOr,BitXor,Add(wrap),Sub(wrap),Debug,DefaultDesign notes
Real
__m256istorage, not scalar fallback. The existingsimd_avx2::U8x64uses scalar[u8; 64]because 64 bytes doesn't fit one ymm register — it's a polyfill-shape compatibility type.U8x32is the real AVX2 byte vector; consumers wanting actual AVX2 SIMD speedup over scalar should iterate in 32-byte chunks via U8x32.Soundness model matches the file's existing pattern:
_mm256_*intrinsics are wrapped inunsafe { }inside safepub fn, trusting AVX2 at compile time (project baseline is x86-64-v3 per the .cargo/config_ndarray_simd.toml template on the bevy side). The existingsimd_avx2.rs:357AVX2 popcount uses this same pattern.permute_bytescross-lane fallback is scalar. AVX2 has no native cross-lane byte permute (_mm256_permutexvar_epi8is VBMI/AVX-512). Matches the shape ofsimd_avx512::U8x64::permute_byteson AVX-512F-without-VBMI hosts (also scalar — landed in PR #142).mask_blenddiffers from U8x64's: AVX-512 uses 64-bit bitmasks; AVX2's_mm256_blendv_epi8is MSB-driven on a 32-byte mask vector. The signature reflects this (Self, Self, Selfnotu64, Self, Self).Re-exports
U8x32is re-exported fromcrate::simd::*on both AVX-512 and AVX2 build tiers — it's the natural AVX2 byte width and useful regardless of whether AVX-512's U8x64 is the consumer's preferred width.Test plan
mod u8x32_testscovering all behavior (constructors, reductions, masks incl. unsigned high-byte cases > 127, saturating clamps, pairwise_avg round-up, nibble shifts, permute, blend, LUT)cargo clippy --features rayon -- -D warnings: cleancargo check --features rayon --lib: cleanWhat follows
The cosmetic-SIMD consumer-rewrite fleet — 12 Sonnet agents, each scoped to ONE function in
hpc/byte_scan.rs/hpc/palette_codec.rs/hpc/aabb.rs, rewriting from#[target_feature]+scalartocrate::simd::U8x32::*calls. Will land as separate PRs after this one merges.Generated by Claude Code